Georgios Petkos, CERTH / ITI, gpetkos@iti.gr [PRIMARY contact]
Konstantinos Moustakas, CERTH / ITI,
moustak@iti.gr
Dimitrios
Tzovaras, CERTH / ITI, Dimitrios.Tzovaras@iti.gr
For the purposes of the challenge, a custom application has been created
using Processing. Two alternative but complementary visualizations of the
relationships between the genetic sequences are used. In the first, sequences
are allocated in 2D space according to their genetic similarity and resulting
disease characteristics using multidimensional scaling (MDS) and in the second,
a minimum spanning tree is computed in order to obtain the most likely mutation
paths between all pairs of nodes. Among other interaction mechanisms, the user
can select sequences from the main visualization and examine their sequences in
an auxiliary visualization. For more details please see the 2 page summary.
The tool has been completely built by our team in CERTH / ITI and the
only external dependency is the MDSJ library for performing the
multidimensional scaling analysis.
Video:
ANSWERS:
MC3.1:
What is the region or country of origin for the current outbreak? Please provide your answer as the name of the
native viral strain along with a brief explanation.
The region of origin is Nigeria_B. Using the
(minimum spanning) tree representation, we have the visualization of the following figure. The
black dots represent the native sequences whereas the red dots represent
outbreak sequences. All outbreak sequences belong in a separate subtree and the
adjacent node is the Nigeria_B strain.
Similarly, using the multidimensional scaling
representation (next figure), it is clear that the outbreak sequences make a
tight cluster which is closest to the native sequence Nigeria_B. Therefore,
since all outbreak sequences are much closer to the Nigeria_B sequence, it is
safe to assume that the outbreak most likely originated in Nigeria_B.
MC3.2:
Over time, the virus spreads and the diversity of the virus increases as it
mutates. Two patients infected with the
Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence
583. One patient has a strain identified
by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each
patient. Which patient likely contracted
the illness from Nicolai and why? Please
provide your answer as the sequence number along with a brief explanation.
The answer is 123. We use the tree representation
and mark the sequences 583, 123 and 51. As it can be seen in the following
figure the sequence 123 is only a single mutation step (link) away from 583,
whereas 3 links separate 583 from 51. Therefore, the patient with the strain
123 is more likely to have contracted the illness from Nicolai. Additionally,
the marked sequences are displayed below the main visualization (only the
positions where at least one sequence differs from the rest are displayed) and it can be seen that the sequence 123
better matches the sequence 583 than the sequence 51.
A similar result is obtained with
the MDS based representation, as displayed in the following figure.
MC3.3:
Signs and symptoms of the Drafa virus are varied and humans react differently
to infection. Some mutant strains from
the current outbreak have been reported as being worse than others for the
patients that come in contact with them.
Identify
the top 3 mutations that lead to an increase in symptom severity (a disease
characteristic). The mutations involve
one or more base substitutions. For this
question, the biological properties of the underlying amino acid sequence
patterns are not significant in determining disease characteristics.
For
each mutation provide the base substitutions and their position in the sequence
(left to right) where the base substitutions occurred. For example,
C
→ G, 456 (C changed to G at position 456)
G
→ A, 513 and T → A, 907 (G changed to A at position 513 and T
changed to A at position 907)
A
→ G, 39 (A changed to G at position 39)
Answer:
A → C, 268
G → C, 211
A → G, 222
(please note
that we start counting at 0)
Using the tree based visualization, we choose to
display only symptom severity from the disease characteristics. Now, we are
looking for the subtree with the most intense red colors. One is located on the
left and is below the mutation from A to C in position 268. Interestingly, when
highlighting this mutation, a second branch, with just another child is also
highlighted, meaning the same mutation occurs in another point in the tree (see
next figure). The occurence of another sequence with severe symptoms and the
same mutation supports the assumption the mutation from A to C in position 268
significantly increases the symptom severity.
Similarly, it is easy to spot that another mutation
that causes increased symptom is the transition from G to C in position 211 as
it is displayed in the following picture.
In a similarly it is easy to spot the last mutation at
position 222 (from A to G), where the red arrow is in the picture above. This
subtree, instead of the one with the blue arrow is chosen, because it has a
larger ratio of sequences with severe symptoms.
MC3.4:
Due to the rapid spread of the virus and limited resources, medical personnel
would like to focus on treatments and quarantine procedures for the worst of
the mutant strains from the current outbreak, not just symptoms as in the
previous question. To find the most
dangerous viral mutants, experts are monitoring multiple disease
characteristics.
Consider
each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the
most dangerous viral strains. The mutations involve one or more base substitutions. In a worst case scenario, a very dangerous
strain could cause severe symptoms, have high mortality, cause major
complications, exhibit resistance to anti viral drugs, and target high risk
groups. For this question, the
biological properties of the underlying amino acid sequence patterns are not
significant in determining disease characteristics.
For
each mutation provide the base substitutions and their position in the sequence
(left to right) where the base substitutions occurred. For example,
C
→ G, 456 (C changed to G at position 456)
G
→ A, 513 and T → A, 907 (G changed to A at position 513 and T
changed to A at position 907)
A
→ G, 39 (A changed to G at position 39).
Answer:
A → T, 945
A → G, 222
A → G, 820
A similar process as in the previous question is
followed. The aggregate characteristic is only displayed and again we are
looking for the subtree with the most intense coloring. We can use both the
tree and the MDS representation along with the tree connectivity. The following
figure displays the visualization using MDS. As it can be seen, the first
mutation which significantly increases the disease characteristics, from A to T
in position 945, is easy to spot.
The other significant mutations are a bit difficult
to see using the MDS representation and we subsequently switch to the regular
tree representation. As it can be seen in the following figure, the one is the
change from A to G in position 222, like in the previous question and the last
one is the change from A to G in position 820 (the red arrow in the next
figure)